Average Path Length in Complex Networks: 
Patterns and Predictions 



Reginald D. Smith 

Bouchet-Franklin Research Institute, P.O. Box 10051 , Rochester, NY 14610 
E-mail: rsmith9sloan.mit.edu 

Abstract. A simple and accurate relationship is demonstrated that links the 
average shortest path, nodes, and edges in a complex network. This relationship 
takes advantage of the concept of link density and shows a large improvement 
in fitting networks of all scales over the typical random graph model. The 
relationships herein can allow researchers to better predict the shortest path of 
networks of almost any size. 
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The research of complex networks has exploded over the past decade with 
literally thousands of papers describing and theorizing about such networks in all 
details. This explosion of research followed the widespread availability of large 
network databases aided by the advance of computer technology and widespread 
online applications used by millions of users. Among the most prominent and well- 
known studies have been those of the Internet metabolic pathways[2], and scientific 
collaborations O [4]. Other networks have also included sexual contacts [5], instant 
messaging[6], Congressional committees [7], jazz musicians]!], blogs[9], airports [10]. 
and rappers [TT] . 

Several review articles have highlighted the main features and characteristics of 
complex networks [T2l [13l [14] . One of the most studied and important features of a 
complex network as been found to be the average path length (or characteristic path) , 
I that characterizes a network. It describes the average number of links that form the 
shortest path between any two nodes in the network. This property, more than any 
other, gives rise to what is known as "small world" behavior. 



1. Brief Properties of I 

In their seminal work that helped ignite research into small world phenomena, Watts 
and Strogatz [T5] describe small world networks as those which are connected, where 
the number of nodes is much larger than the average degree per node, and the average 
path length scales with log n. Though random graphs can exhibit small world behavior, 
most graphs in the real world are not random and are often distinguished from 
random graphs by a relative high degree of clustering among nodes as measured by the 
clustering coefficients. Watts and Strogatz also described an estimate from random 
graph theory for the average path length of a random graph, which has become very 
useful for comparison with real networks, 

ln(k) ( > 

where N is the number of nodes and k is the average degree per node in the 
network which is E/N for directed networks and 2E/N for undirected networks where 
E is the number of links (edges) in the graph. This equation gives a very good 
approximation for many networks and though it is not exact, it usually gives a good 
rough estimate. However, as an approximation it is usually only used to compare the 
average path length of a graph using real or simulated data and a similar random 
graph with the same N and (k). 

There has been much more work done on I describing its theoretical relationship 
with the small world network it characterizes [16l [17l [HI [TU [20] . Small- world networks 
have been analyzed using percolation theory and mean field theory among others to 
attempt to understand the exact nature of the transition from a "large" to a small 
world network. Since I is one of the key parameters that signifies such a change, its 
theoretical relationship has been investigated in order to relate it to other properties 
of complex networks such as the correlation length of the network. 



2. Link Density and Average Path Length 

Though the random graph approximation is useful, it can be asked whether there is a 
better model for complex networks that can explain known data. Complex networks, 
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despite having similar average path lengths or clustering coefficients can vary in 
other measures such as first order degree distributions, assortative (or disassortative) 
mixing, and sizes of connected components. Given the many important topological 
features, much less the feedback with the dynamics on the network that affects network 
evolution, it can be questioned whether any more precise generalization is possible 
among complex networks. 

One key concept that can link many disparate graphs, despite their number of 
nodes, is the concept of network density. Network density has been described in some 
papers [HJ [55]. The definition used here is the ratio of the number of edges in the 
network over the total possible number of edges in the complete graph 

a= N(N-l) (2) 

The network density has a maximum of f in a complete graph and a minimum 
of 2/(N — 1) w 2/N in a simple ring topology. This density is also identical to the 
value of p, the probability two nodes will be connected by a link. However, it will 
be referred to as density throughout this paper since this paper does not concentrate 
on aspects of probability or percolation theory. This density will be used for both 
directed and undirected networks. For simplification, since the number of nodes in a 
network is usually N 3> 1 the link density can also be approximated as 

a = N2 ( 3 ) 

However, the link density does not directly correlate with the average path length. 
Matching a against I shows very little relationship. Part of the problem is that with 
increasingly larger networks, a lower link density is sufficient to obtain a given average 
path length. In general, a larger network has a much smaller a for a given I than a 
smaller network with a similar I. 

Though the relationship between the two variables is tenuous, their product has 
several interesting properties 

D = al (4) 

D, is equal to 1 for both complete graphs (E — N 2 /2,J = 1) and directed ring 
topologies (E = N,l — N/2) which as strongly connected clusters are networks with 
the longest possible average path. The undirected ring topologies have I = N/4. 
However, outside these two extreme cases,!?, typically does not equal 1 but has a 
much lower value, but greater than 0. The value of D varies much with a so also has 
a large dependence on the size of the network with I having a minimal impact. 

A way to resolve the issue is to find a method of normalizing the network density 
so it is comparable across networks of all sizes. I define a normalized network density, 
a s , by taking the logarithm of the network density with a base of N/2 and adding 1 
which is equivalent to 

- = 1 + i^ < 5 > 

where log here designates a natural logarithm. When the network is a complete 
network the normalized network density is 1, while for ring topology, it equals 0. 
Therefore, the size of the network will not affect the minimum network density. 
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Figure 1. Plots of — i-= vs. normalized network density. The slope of the fit is 
log! 

1.5 respectively with an intercept of 0.4. B? of 0.78 
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Figure 2. Plots of the same network data with I vs. -S^y . The data fits relatively 
well for graphs with small average path lengths or small N but shows greater 
disparities when these conditions are not met. 

Not only does this normalized network density allow you to compare network 
densities over networks of various sizes, it actually demonstrates a correlation with 
the path length of the network, in particular the inverse of log I. 

The graph in Figure [T] was developed using data from 39 different networks 
described in various papers. The values of these networks are shown in Appendix I. 
These networks are of many different types and have been given broad categorizations 
following those used by Newman [12]. 

This relationship was experimentally discovered and not quite expected. In fact, 
one of its more interesting properties is how it fits disparate networks, of all scales and 
average path lengths, accomodating the data better than the random graph estimation 
in Figure [2] 

In fact in Figure [T] the main points that do not fit well to the least squares line 
are those from biological networks, including the food webs (freshwater and marine) 
and metabolic networks. This may indicate either the data on these networks or 
incomplete or the underlying organizational property driving this relationship is less 
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Figure 3. Plots of a graph of 1,000 nodes from 1,100 to 20,000 nodes where I = 

InJV 
In <fc) 

active in biological networks. The relation derived from the regressions implies 

ma s + C = = (6) 

log I 

this allows us to relate I to the normalized density with the equation 

l = e T^7+c (7) 

A quick but interesting example can be made using equation [7J Assuming the 
US population is 300M and accepting Milgram's six degrees of separation (l = 6) we 
can estimate the average (k) for the US population at 14.6. This is much less than 
the 25.9 estimated from random graph theory (and assumes us to be substantially less 
gregarious). 

A key question about the relationship is Figure [T] is how widely it applies to 
all types of networks. All of the networks sampled are described by authors as 
having "scale- free" or "long-tailed" characteristics. Obviously, graph theory does not 
constrain a network from being of this type so by looking at the relationship using 
data from an artificial random graph we can begin to push the boundaries of its 
applicability. 

In Figure [3] it is clearly visible that the linear relation among real networks also 
holds for random graph data. 

The slope of the plot from a random graph approximation is 2.04 which is slightly 
steeper than the slope of data from real networks. Therefore, this relationship is likely 
widely held among many small-world networks with a variety of topologies though 
there are likely exceptions. 

3. Discussion 

First, it should be acknowledged that though this relationship seems to fit a wider 
variety of real networks than random graph theory, it is not perfect. From the 
standards of theoretical prediction, the statistical fit still allows much leeway for 
the relationship between the quantities plotted against each other. However, this 
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relationship does fit data more consistently over all size scales in real networks than 
the usual random graph theory treatment. 

Despite the interesting relationship this data reveals, it also raises the question 
of what the parameters of the linear plot actually mean. One clue can be gleaned 
from looking at the rate of change of the average shortest path vs. the normalized 
link density Jj=-. 

Given equation [6] you can easily deduce from the fact that 

ril 

£r = -ml[\ogl] 2 (8) 

So the slope, m, can be seen as the constant of proportionality between the rate of 
change with increasing network density in the average shortest path and the average 
shortest path. In fact, Equation [5] gives quite intuitive solutions since m > 0. The 
larger 7 is, the more rapidly you can reduce the average shortest path of the network 
by increasing the network density. This intuitively fits with the observation by Watts 
and Strogatz [15] that in relatively sparse topologies, shortcuts can drastically reduce 
the average path length of the network leading to small world behavior. As the 
network becomes more dense, such short cuts give incrementally smaller reduction in 
the network average path. When you reach a complete network at I = 1 there is a 
fixed point given you have maximum density and can no longer reduce the diameter 
of the network. 

Additionally, m could be some measure of a quantity such as the " mass" of the 
network. If log I can somehow be seen as a length then this gives additional meaning 
to the normalized link density. Here m would be a characteristic mass in all networks 
that is distributed over a one-dimensional interval determined by log! and a s measures 
the resultant length density. However, the question of what m really is as far as a 
value is still unanswered. The fact that it seems consistent across such a wide variety 
of networks suggests it is some constant, perhaps of a transcendental number or ratio. 
This is all speculative though. Until a firm theoretical underpinning for the above 
results is made, the exact value of m is still subject to speculation. 

In addition, although this relation seems to hold across a wide variety of networks, 
there are obviously situations where equations such as equation [7] break down. For 
example, when a s is 0, a complete graph, I should have a value of 1, however, this does 
not necessary flow from the relations shown here. The only exception is equation[5]that 
shows a fixed point at I — as is expected. Therefore, in the regions of nearly complete 
graphs or sparse graphs, possibly where p < p c where p c is the critical probability 
from the percolation theory, this relationship does not reliably apply. However, these 
regions are not the domain of almost all real networks. 

4. Conclusion 

This paper has shown that there is an intrinsic relationship between the average path 
length I and the normalized link density, related to the number of nodes and edges, that 
is present in all networks. This relationship fits well in both real networks which often 
have a scale-free or non-random character, but can also describe random networks as 
well and likely most small world networks. Given the breakdown of the theory near 
the complete and ring topologies it may be surmised this only applies to graphs with 
small- world character that have a link probability p c < p < 1. Much more research is 
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needed, however, to determine the exact reason for this relationship and in particular, 
the meaning of the parameter m. 
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